Predicting CO and NOx emissions from Gas Turbines

1.The combustion processes of fossil fuels used in power plants andvehicles comprise the major portion of air pollution. NOx (NOx = NO2 + NO) are considered the primarypollutants of the atmosphere, since they are responsible for environmental problems such as photochemicalsmog, acid rain, tropospheric ozone, ozone layer depletion, and eventually global warming.

2.An important source of harmful pollutants (NOx and CO) released in the atmosphere is the combustion process in the power industry. Therefore, there is a special concern on reducing the emissions from powerplants.NOx and CO emissions are limited to 25 ppmdv (parts per million by dry volume) by the EU when natural gas is used as fuel.

Objective: In order to determine which factor will be the primary contributor to the amount of CO and NOx increasing, we want to examine the correlation between CO, NOX, and other qualities and try to implement a model.

1. Data Exploration

Before doing anything else with the data let's see if there are any null values (missing data) in any of the columns.

We have no missing data so all the entries are valid for use.

Here we get max,min,mean and std for all the attributes.

2.EDA Analysis

Some of the features are normally distributed. The features AH, CO, TITy and TATa exhibit the highest skew coefficients. Moreover, the distribution of Carbon Mono oxide (CO) and Turbine inlet temperature (TIT) and Turbine after temperature (TAT) seem to contain many outliers.

It's a simple plot between number of observation Vs the attributes such as (y=AT,AP,AH....)

Pearson's Correlation Coefficient: helps you find out the relationship between two quantities. It gives you the measure of the strength of association between two variables. The value of Pearson's Correlation Coefficient can be between -1 to +1. 1 means that they are highly correlated and 0 means no correlation.

A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information

A correlation coefficient measures the strength of the relationship between two variables. The most commonly used correlation coefficient is the Pearson coefficient, which ranges from -1.0 to +1.0. A positive correlation indicates two variables that tend to move in the same direction. A negative correlation indicates two variables that tend to move in opposite directions. A correlation coefficient of -0.8 or lower indicates a strong negative relationship, while a coefficient of -0.3 or lower indicates a very weak one.

While in above correlation we can observe that their is strong negative correlation between CO and Turbine parameters

In NOX

In case of NOX their is strong negative relation between AT and NOX,while mild negative relation between NOX and AFDP respectively.

TEY corr:

Have Strong positive correlation with CDP , GTEP and TIT while CO and TAT have high negative correlation.

Above fig reveals that existence of a very strong linear dependency among the input variables, particularly between compressor discharge pressure (CDP) and turbine energy yield (TEY) (0.99), similarly CDP and gas turbine exhaust pressure (GTEP) (0.98). GTEP has also very strong correlation with TEY (0.96). This shows that some of the features may contain redundant information, and thus can be eliminated during model learning. Moreover, we see that the five turbine parameters (namely GTEP, CDP, AFDP, TIT, and TAT) have stronger correlations with TEY, compared to the three ambient variables (AT, AP, and AH) used as features

3.Model Selection and Accuracy

For CO: we see that the five turbine parameters (namely GTEP, CDP, AFDP, TIT, and TEY) have stronger correlations with CO. SO we will look into only these five data set

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.

We can use multiple linear regression to find out how strong the relationship is between two or more independent variables and one dependent variable (e.g. How Ambient temperature(AT), Ambient pressure (AP), Ambient humidity(AH), Air filter difference pressure (AFDP), Gas turbine exhaust pressure (GTEP), Turbine inlet temperature(TIT), Turbine after temperature(TAT), Compressor discharge pressure (CDP), Carbon monoxide(CO) and Nitrogen oxides (NOx) added Turbine energy yield (TEY)). Also we can use multiple linear regression to find out the value of the dependent variable at a certain value of the independent variables (e.g. the expected Turbine energy yeild (TEY) at certain levels of Ambient temperature(AT), Ambient pressure (AP), Ambient humidity(AH), Air filter difference pressure (AFDP), Gas turbine exhaus pressure (GTEP), Turbine inlet temperature(TIT), Turbine after temperature(TAT), Compressor discharge pressure (CDP), Carbon monoxide(CO) and Nitrogen oxides (NOx)).

Multiple linear regression formula

yi = B0 + B1X1 + ... BnXn

y = the predicted value of the dependent variable B0 = the y-intercept (value of y when all other parameters are set to 0) B1X1= the regression coefficient (B1) of the first independent variable (X1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value) … = do the same for however many independent variables you are testing BnXn = the regression coefficient of the last independent variable

We have divided the datas into “attributes” and “labels”.

Attributes are the independent variables while labels are dependent variables whose values are to be predicted.

We have split 80 % of the data to the training set while 20 % of the data to test set using above code. The test_size variable is where we actually specify the proportion of the test set. For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split

After splitting the data into training and testing sets, We have to train our algorithm. For that, we need to import LinearRegression class, instantiate it, and call the fit() method along with our training data.

Printing the intercept and coefficient of the model

As we have discussed that the linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. From above code, we can find the value of the intercept and slop calculated by the linear regression algorithm for our dataset

As we have discussed that the linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. From above code, we can find the value of the intercept and slop calculated by the linear regression algorithm for our dataset

The above plot shows that our predicted CO is somewhat or aprroximately close to actual CO,we get the straight line if we remove theoutliers

We have a score (Accuracy) of 72%

The Mean absolute error represents the average of the absolute difference between the actual and predicted values in the dataset. It measures the average of the residuals in the dataset.

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.

Mean Squared Error represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.

Now Same process for NOX:

As per correlation matrix, we see the strongest correlation with the ambient temperature,which suggests working at higher temperatures is more appropriate to reduce this exhaust emission.

NOX:AT,AFDP,AH,TEY,AP

As you can see their is a variation between Predicted NOX vs Actual NOX while compare to CO case,which shows low accuracy wothout calculationg it.

Hence accuracy of NOX using linear regression is not so good as compared to CO

Accuracy: CO:72% NOX:41.6%

Conclusion:

Therefore, CO definitely outperforms NOX in terms of accuracy, and we can attempt another method or algorithm for a better outcome.Their are other alternative method such as Random Forest,SVM,Decision Tree we can get good result.

Reference:

Predicting CO and NOx emissions from gas turbines: novel data and a benchmark PEMS Heysem KAYA,Pınar TÜFEKCİ,Erdinç UZUN, Department of Computer Engineering, Çorlu Faculty of Engineering, Namık Kemal University, Tekirdağ, Turkey Turkish Journal of Electrical Engineering & Computer Sciences